Quick links (more details at introduction):

Assignment used the base docker image acessible here.

Chosen study: Serotonin-induced hyperactivity in SSRI-resistant major depressive disorder patient-derived neurons(Vadodaria et al., 2019).

GEO data set: GEO accession GSE125664(Vadodaria KC, 2019)

1. Introduction: A1 summary

1.1 Selection of an expression data set and software tools

Previously, in A1, we have found an appropriate study (in terms of RNA expression analysis and clinical relevance) by searching RNAseq data set available at [Gene Expression Omnibus repository - GEO] (www.ncbi.nlm.nih.gov/geo)(“Gene Expression Omnibus,” n.d.).

We used Geometadb(Zhu et al., 2008) to perform the search of GEO data sets and RSQLite(Müller et al., 2020) and GEOquerry(Davis & Meltzer, 2007) package to perform the queries using a downloaded copy of GEO contents’ metadata. These packages were installed, when needed through BiocManager(Morgan, 2019) functionality. knitr(Xie, 2021) package was later used to report data statistics and edgeR to aid on modeling and normalization of data.

After 4 search iterations we have chosen the following study: [Serotonin-induced hyperactivity in SSRI-resistant major depressive disorder patient-derived neurons] (https://pubmed.ncbi.nlm.nih.gov/30700803/)(Vadodaria et al., 2019) in which the authors tried to elucidate the common clinical challenge of patients suffering from Major Depressive Disorder (MDD) who fail or respond only partially to SSRI therapy (aka non-remitters). We currently have no biomarkers to identify these patients. Moreover, is not known the specific mechanism underpinning SSRI and similar SNRI-induced suicidality.

It also provided the raw mRNA counts for analysis. The controls are represented by expression data in forebrain neuron lines developed from Induced pluripotent stem cells (iPSCs) removed through non-CNS biopsy of healthy patients exposed to SSRIs. The test group involve the same cells from MDD patients which suffered from SSRI treatment resistance and MDD patients which responded to SSRI treatment.

The sample size in this study is 9, with adequate control and 2 testing groups (MDD with an without SSRI-based treatment resistance).

1.2. Downloading the data, cleaning and compute summary statistics

The data for this study is contained in one supplemental .csv file called “GSE125664_Vadodaria_MDDNeurons_RawCounts”:

Platform title: Illumina HiSeq 2500 (Homo sapiens)

Submission date: Mar 14 2013

Last update date: Mar 27 2019

Organisms: Homo sapiens

Snapshot of data first 10 mapped genes at this stage:

kable(SSRI_exp[1:10,1:10], format = "html")
gname H_neurons_1 H_neurons_2 H_neurons_3 NR_neurons_1 NR_neurons_2 NR_neurons_3 R_neurons_1 R_neurons_2 R_neurons_3
A1BG 220 219 205 211 202 134 270 163 220
A1BG-AS1 106 98 96 91 165 128 104 130 120
A1CF 0 0 1 0 0 0 0 0 1
A2M 9441 18 4 2541 336 488 2396 1584 72
A2M-AS1 9 12 9 22 36 5 26 31 7
A2ML1 11 0 1 6 53 23 21 10 1
A2MP1 4 0 0 1 0 0 0 1 0
A3GALT2 4 1 1 3 10 7 5 6 2
A4GALT 85 48 78 29 1 0 17 0 43
A4GNT 0 1 0 0 1 0 0 0 0
dim(SSRI_exp)
## [1] 22351    10

And we have 22351 genes with measures for 3 controls, 3 non-resistant-to-SSRI MDD patients and 3 resistant-to-SSRI MDD patients.

Counts for each gene show each gene with the frequency of exactly one, except for some miscoded names.

28 genes contained miscoded names (they expressed dates instead of valid identifiers) and we will have to remove them.

Each mRNA reading was already mapped to unique genes expressed as HUGO Gene Nomenclature Committee symbols) so we proceeded to calculate counts per million reads mapped (cmp). All the counts with less than 9 counts where removed and that removed 9350 gene observations. There after that, and after the cleaning the data, there were no duplicates and all the symbols where unique symbols that complied with HUGO Gene Nomenclature Committee. After all removals and manipulations, the dataset coverage was 0.59.

We have kept 12973 gene counts, meaning we had to filter out 0.4188505 of our gene-mapped RNA seq observations

1.3. Normalization

We used TMM (Trimmed Mean of M-values) normalization. For most samples the distribution curve pre and post-normalization is very similar, however there is a noticiable change towards the mean (a bit higher than 5) for the second control (H_neuron_2) and, interestingly, for 2 of the 3 nonremitter patients.

data2plot <- log2(cpm(SSRI_exp_filtered[,2:10]))

SSRI_matrix <- as.matrix(SSRI_exp_filtered[,-1])

rownames(SSRI_matrix) <- SSRI_exp_filtered$gname

# inspired by Yi Fei Huang approach in terms of merging samples into groups

categories <- c("healthy", "healthy", "healthy",
             "nonremitter", "nonremitter", "nonremitter",
             "remitter", "remitter", "remitter")

# inspired by lecture 4 slides 51 and 52

d = DGEList(counts=SSRI_matrix, group=categories)

# Calculation of normalization factors

d = calcNormFactors(d)

normalized_counts <- cpm(d)

2. Differential Gene Expression analysis

2.1 Model design and Differential Expression Significance Calculation with correction

In this section, we will used the normalized data produced above to test for statistically significant expression differences for each of the mapped genes, between our 3 groups (we will call it status) of interest in our designed model: healthy patients exposed to SSRI (control), non-remitters MDD exposed to SSRI (test 2) and remitter MDD exposed to SSRI (test 2).

An expansion of this model includes 2 groups (we will call this attribute “affected”), healthy (ND) and affected by disease (D), although this expansion wouldn’t add granularity to the assessment of different subtypes of MDD (refractive or not to treatment), it could help in increasing specificity to the bayes differential expression significance calculation.

# First we create our model

samples <- data.frame(status = c("H", "H", "H", "NR", "NR","NR", "R", "R", "R"),
                      affected = c("ND", "ND", "ND", "D", "D","D", "D", "D", "D"))

rownames(samples) <- colnames(SSRI_exp[-1])
# model on affected Vs healthy
model_design <- model.matrix(~samples$affected)
# model on status of remission to SSRI (remitters Vs non-remitters)
model_design_2 <- model.matrix(~samples$status)

# compound model taking into account both conditions will not yield independent coefficients:
# model_design_compound <- model.matrix(~samples$affected + samples$status)

Now we will fit our normalized data to each of the built models.

2.1.1 Using Limma

# As inspired by lecture 6, using limma, doesn't add result in significantly different
# restults compared with chosen approach (see following code chunk):
expressionMatrix <- as.matrix(normalized_counts)
rownames(expressionMatrix) <- SSRI_exp_filtered$gname
colnames(expressionMatrix) <- colnames(SSRI_exp_filtered)[2:10]
minimalSet <- ExpressionSet(assayData=expressionMatrix)
# Fitting for Affected Vs Controls
fit <- lmFit(minimalSet, model_design)
# Fitting for type of SSRI response (remitters and non-remitters)
fit_resp <- lmFit(minimalSet, model_design_2)

Using empirical Bayes method:

fit_2 <- eBayes(fit_resp,trend=TRUE)
fit_resp_2 <- eBayes(fit_resp,trend=TRUE)

Now we can calculate the differential expression and apply the correction for multiple hypothesis test through Benjamni-Hochberg method (less than 1 minute processing time).

#calculate differential expression, inspired by lecture 6
topfit <- topTable(fit_2, coef=ncol(model_design), adjust.method = "BH",
number = nrow(expressionMatrix))

topfit_resp <- topTable(fit_resp_2, coef=ncol(model_design_2), adjust.method = "BH",
number = nrow(expressionMatrix))

gene_names <- data.frame(SSRI_exp[,1])
colnames(gene_names) <- c("hgnc_symbol")
output_hits <- merge(gene_names, topfit, by.y=0, by.x=1, all.y=TRUE)
output_hits_2 <- merge(gene_names, topfit_resp, by.y=0, by.x=1, all.y=TRUE)

#sort by p-value
output_hits <- output_hits[order(output_hits$P.Value),]
output_hits_2 <- output_hits_2[order(output_hits_2$P.Value),]

And the results for each of the models, with and without Benjamni-Hochberg corrections for multiple hypothesis tests are as following.

For Healthy Vs MDD patients:

kable(output_hits[1:10,],type="html",row.names = FALSE)
hgnc_symbol logFC AveExpr t P.Value adj.P.Val B
ARHGAP20 31.790074 13.690482 9.103935 0.0000085 0.0762015 -4.086061
FUT9 297.971376 143.629692 8.744316 0.0000117 0.0762015 -4.090963
TGFB2 144.836367 69.823697 8.082447 0.0000220 0.0952138 -4.101465
PCDH17 161.793691 95.354410 7.666083 0.0000334 0.1082782 -4.109250
CPLX1 17.053081 11.094874 7.232782 0.0000525 0.1236884 -4.118530
NXPH4 18.140995 17.926544 7.151887 0.0000572 0.1236884 -4.120412
PCDHA12 9.388553 8.427546 6.575008 0.0001084 0.1769541 -4.135432
MDK -268.541560 209.302062 -6.352621 0.0001401 0.1769541 -4.142065
TRIM9 190.655235 140.332625 6.250673 0.0001580 0.1769541 -4.145284
CENPW -25.163268 15.527767 -6.182821 0.0001712 0.1769541 -4.147491

For Remitter status:

kable(output_hits_2[1:10,],type="html",row.names = FALSE)
hgnc_symbol logFC AveExpr t P.Value adj.P.Val B
FAM222A 16.581813 12.503918 6.505383 0.0001174 0.6138379 -4.485416
NHSL2 5.669770 4.068259 6.123237 0.0001838 0.6138379 -4.488182
CENPW -21.845698 15.527767 -5.367667 0.0004719 0.6138379 -4.494934
TROAP -14.185957 11.867062 -5.170270 0.0006116 0.6138379 -4.497039
AUTS2 189.589420 242.091556 5.079170 0.0006906 0.6138379 -4.498066
SHROOM2 53.631027 45.223557 5.009858 0.0007581 0.6138379 -4.498871
KIFC1 -27.544679 25.241187 -4.950808 0.0008213 0.6138379 -4.499575
E2F8 -6.080405 4.980789 -4.922310 0.0008537 0.6138379 -4.499920
ATP1A3 354.375848 301.299852 4.910043 0.0008682 0.6138379 -4.500070
NXPH4 11.796362 17.926544 4.650586 0.0012431 0.6138379 -4.503414

Applying a threshold of p-value < 0.05 and computing how many genes passed the test:

length(which(output_hits$P.Value < 0.05))
## [1] 2612
length(which(output_hits_2$P.Value < 0.05))
## [1] 983
  • 2612 genes passed the test of significance for change in transcription factor for the healthy VS MDD model
  • 983 genes passed the test of significance for change in transcription factor for the "type-of-response-to-SSRI model

Now computing how many mapped genes passed the p-value < 0.05 with Benjamni-Hochberg correction for multiple testing

length(which(output_hits$adj.P.Val < 0.05))
## [1] 0
length(which(output_hits_2$adj.P.Val < 0.05))
## [1] 0
  • No gene passed the test of significance for change in transcription factor for the healthy VS MDD model with correction
  • No gene passed the test of significance for change in transcription factor for the "type-of-response-to-SSRI model with correction

2.1.2 EdgeR Simple approach using one model

In this approach we will use the model that differentiates between MDD patients that were non-responsive to SSRI (non-remitter or NR) from responsive patients (remitter or R), since this is main approach took by the paper’s authors in its published version’s transcriptome analysis (see (Vadodaria et al., 2019), page 801, specially fig 4.b)

d = DGEList(counts=normalized_counts, group=samples$statusNR)

# Dispersions
d <- estimateDisp(d, model_design_2)

# Computing normalization factors
d <- calcNormFactors(d)

# fitting models
fit <- glmQLFit(d, model_design_2)

# Computing differential expression
qlf.pos_vs_neg <- glmQLFTest(fit, coef='samples$statusNR')
# Results
qlf_output_hits <- topTags(qlf.pos_vs_neg, sort.by = "PValue", n = nrow(SSRI_exp))
# Passing p-value < 0.05?
length(which(qlf_output_hits$table$PValue < 0.05))
## [1] 2536
# Passing correction
length(which(qlf_output_hits$table$FDR < 0.05))
## [1] 0
## [1] 0
  • 2536 genes passed the test of significance for change in transcription
  • No gene passed the test of significance for change in transcription

2.1.3 Using paper’s modeling

Since the previous 2 modeling failed to yield significant results post-correction, we decided to use the model applied by the source study (Vadodaria et al., 2019), page 801. Thus, we design 2 distinct but symmetrical models that will assess remitters MDD patient’s cells VS Healthy control cells and remitters MDD patient’s cells VS non-remitters MDD patient’s cells separately.

samples <- data.frame(status = c(“H”, “H”, “H”, “NR”, “NR”,“NR”, “R”, “R”, “R”), affected = c(“ND”, “ND”, “ND”, “D”, “D”,“D”, “D”, “D”, “D”))

# Creating model for non-remitters VS control assessment
samples_NRxH <- data.frame(status = c("H", "H", "H", "NR", "NR","NR"))
rownames(samples_NRxH) <- colnames(SSRI_exp)[2:7]
model_design_NRxH <- model.matrix(~samples_NRxH$status)

# Creating model for non-remitters VS remitters assessment
samples_NRxR <- data.frame(status = c("NR", "NR","NR", "R", "R", "R"))
rownames(samples_NRxR) <- colnames(SSRI_exp)[5:10]
model_design_NRxR <- model.matrix(~samples_NRxR$status)

Now we apply the each model separately to the appropriate data set:

d_NRxH = DGEList(counts=normalized_counts[,1:6], group=samples_NRxH$statusNR)
d_NRxR = DGEList(counts=normalized_counts[,4:9], group=samples_NRxR$statusNR)


# Dispersions
d_NRxH <- estimateDisp(d_NRxH, model_design_NRxH)
d_NRxR <- estimateDisp(d_NRxR, model_design_NRxR)


# Computing normalization factors
d_NRxH <- calcNormFactors(d_NRxH)
d_NRxR <- calcNormFactors(d_NRxR)


# fitting models
fit_NRxH <- glmQLFit(d_NRxH, model_design_NRxH)
fit_NRxR <- glmQLFit(d_NRxR, model_design_NRxR)


# Computing differential expression
qlf.pos_vs_neg_NRxH <- glmQLFTest(fit_NRxH, coef='samples_NRxH$statusNR')
qlf.pos_vs_neg_NRxR <- glmQLFTest(fit_NRxR, coef='samples_NRxR$statusR')
# Results, now inspired by lecture 7
qlf_output_hits_RNxH <- topTags(qlf.pos_vs_neg_NRxH, sort.by = "PValue", n = nrow(SSRI_exp))
qlf_output_hits_RNxR <- topTags(qlf.pos_vs_neg_NRxR, sort.by = "PValue", n = nrow(SSRI_exp))

# Here we check how many mapped genes passed the 0.05 significance threshold with and without correction
# for the comparison remitter Vs control
length(which(qlf_output_hits_RNxH$table$PValue < 0.05))
## [1] 2306
length(which(qlf_output_hits_RNxH$table$FDR < 0.05))
## [1] 0
# Here we check how many mapped genes passed the 0.05 significance threshold with and without correction
# for the comparison non-remitters Vs remitters
length(which(qlf_output_hits_RNxR$table$PValue < 0.05))
## [1] 732
length(which(qlf_output_hits_RNxR$table$FDR < 0.05))
## [1] 0

Although the results indicate the need for enrichment set analysis (which is one of the purpose of this assignment, as follows), this results are now more comparable with the source literature(Vadodaria et al., 2019), in which, at least for the set of genes presented there, the differential expression between remitters and controls is larger than the one between remitters and non-remitters.

Another thresholding, now with p=0.09 is now performed, despite not being considered as a strong evidence of up or downregulation. We decided to perform this aditional less stringent comparision in order to compare with the original paper’s finding of HTR set (specifically HTR2A) are up-regulated in non-remitters compared to remmiters.

length(which(qlf_output_hits_RNxR$table$PValue < 0.09))
## [1] 1275

And it increases drastically the amount of hits, compatible with what the paper presented, although we won’t proceed further with the 0.09 threshold in order to keep relevance to any eventual finding.

2.2 General and highlighted MA plot

Our top results through the last model were as follows, for the remitters VS control comparison:

kable(topTags(qlf.pos_vs_neg_NRxH), type="html")
logFC logCPM F PValue FDR
ARHGAP20 3.713885 4.306479 77.31141 0.0000214 0.1708668
SLC4A4 3.099146 6.575306 61.05590 0.0000504 0.1708668
SLC7A8 -3.278140 6.041278 59.35811 0.0000558 0.1708668
BAMBI -3.008148 5.296464 49.88476 0.0001035 0.1708668
ACTN2 4.031646 5.343578 49.59019 0.0001056 0.1708668
TGFB2 3.099359 6.505177 46.58973 0.0001314 0.1708668
CPLX1 2.446575 3.788883 45.46104 0.0001431 0.1708668
ZIC5 -3.337702 6.662613 42.84900 0.0001756 0.1708668
SALL3 2.734075 5.088882 40.29411 0.0002169 0.1708668
SORL1 2.697899 5.687669 39.10434 0.0002402 0.1708668
x
BH
x
samples_NRxH$statusNR
x
glm

And for the non-remitters VS remitters comparision:

kable(topTags(qlf.pos_vs_neg_NRxR), type="html")
logFC logCPM F PValue FDR
SLC7A8 3.449887 6.203708 90.94035 0.0000397 0.5152664
ARHGAP20 -2.931650 4.379729 60.34789 0.0001410 0.5894561
C1QTNF6 2.029465 6.054122 54.38340 0.0001932 0.5894561
MEF2C -4.649641 7.619806 52.41456 0.0002158 0.5894561
CNTN5 2.584843 3.021673 51.14874 0.0002322 0.5894561
KAL1 -3.726279 7.323623 48.47498 0.0002726 0.5894561
TGFB2 -2.589404 6.573207 45.33219 0.0003326 0.6122787
ACTN2 -3.719823 5.366947 39.97009 0.0004812 0.6122787
PLEKHA7 1.632958 4.925225 38.57605 0.0005335 0.6122787
MID1 2.077457 7.057827 37.98509 0.0005578 0.6122787
x
BH
x
samples_NRxR$statusR
x
glm

In order to have a more general view of the differences in expression among non-remitters and controls and among remitters and non-remitters we decided to build an MA plot for each of these assessments. For that, we will use plotMA from the limma package(Ritchie et al., 2015). We will also use the argument at the function (see documentation), to include the controls.

Comparing first pair of samples (remmitter Vs non-remmiters and its control)

plotMA(log2(SSRI_exp[,c(5,8)]), ylab="M - ratio log expression", main="MA plot R VS NR 1st pair", status = SSRI_exp[,c(2)])

Legend: Differential expression profile on the first pair of samples (remmitter Vs non-remmiters and its control)

Second pair:

plotMA(log2(SSRI_exp[,c(6,9)]), ylab="M - ratio log expression", main="MA plot R VS NR 2nd pair", status = SSRI_exp[,c(3)])

Legend: Differential expression profile on the second pair of samples (remmitter Vs non-remmiters and its control)

Third pair:

plotMA(log2(SSRI_exp[,c(7,10)]), ylab="M - ratio log expression", main="MA plot R VS NR 3rd pair", status = SSRI_exp[,c(4)])

Legend: Differential expression profile on the third pair of samples (remmitter Vs non-remmiters and its control)

No useful insight is prompt when such large number of gene mapping is available so we decided to highlight only the group of genes related to the paper’s biological hypothesis, concerning the following HTR family genes, which interestingly are more actively differentially expressed than the average:

plotMA(log2(SSRI_exp[7844:7866,c(5,8)]), ylab="M - ratio log expression", main="HTR family MA plot R VS NR 1st pair", status = SSRI_exp[,c(2)])

Legend: Differential expression profile on the first pair of samples (remmitter Vs non-remmiters and its control for the following HTR family members:

1)HTR1A 2)HTR1B 3)HTR1D 4)HTR1E 5)HTR1F 6)HTR2A

7)HTR2B 8)HTR2C 9)HTR3A 10)HTR3B 11)HTR3C 12)HTR3D 13)HTR3E

plotMA(log2(SSRI_exp[7844:7866,c(6,9)]), ylab="M - ratio log expression", main="HTR family MA plot R VS NR 2nd pair", status = SSRI_exp[,c(3)])

Legend: Differential expression profile on the second pair of samples (remmitter Vs non-remmiters and its control for the following HTR family members:

1)HTR1A 2)HTR1B 3)HTR1D 4)HTR1E 5)HTR1F 6)HTR2A

7)HTR2B 8)HTR2C 9)HTR3A 10)HTR3B 11)HTR3C 12)HTR3D 13)HTR3E

plotMA(log2(SSRI_exp[7844:7866,c(7,10)]), ylab="M - ratio log expression", main="HTR family MA plot R VS NR 3rd pair", status = SSRI_exp[,c(4)])

Legend: Differential expression profile on the third pair of samples (remmitter Vs non-remmiters and its control for the following HTR family members:

1)HTR1A 2)HTR1B 3)HTR1D 4)HTR1E 5)HTR1F 6)HTR2A

7)HTR2B 8)HTR2C 9)HTR3A 10)HTR3B 11)HTR3C 12)HTR3D 13)HTR3E

2.3 Heat map

In order to visualize our mostly diferentially expressed genes we used a heat map as provided by the ComplexHeatmap package(Gu et al., 2016)), also using circlize package (Gu et al., 2014). However the processing time was long for the whole dataset using those tools.

# As inspired in Lecture 6

# Creating matrix
heatmap_matrix <- normalized_counts[,2:ncol(normalized_counts)]

rownames(heatmap_matrix) <- rownames(normalized_counts)

colnames(heatmap_matrix) <- colnames(normalized_counts[,2:ncol(normalized_counts)])

heatmap_matrix <- t(scale(t(heatmap_matrix)))

# Setting colours
if(min(heatmap_matrix) == 0){
  heatmap_col = colorRamp2(c( 0, max(heatmap_matrix)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix), 0,
max(heatmap_matrix)), c("blue", "white", "red"))
}

# Creating heat map
current_heatmap <- Heatmap(as.matrix(heatmap_matrix),
show_row_dend = TRUE,show_column_dend = TRUE,
col=heatmap_col,show_column_names = TRUE,
show_row_names = FALSE,show_heatmap_legend = TRUE)
current_heatmap

So we decided to use the limma package(Ritchie et al., 2015) tools instead, using a threshold of 0.01 for significance for a cleaner view.

First for non-remitters VS controls:

# As inspired by lecture 6

heatmap_matrix <- normalized_counts[,1:6]

rownames(heatmap_matrix) <- rownames(normalized_counts)

colnames(heatmap_matrix) <- colnames(normalized_counts[,1:6])

heatmap_matrix <- t(scale(t(heatmap_matrix)))

top_hits <- output_hits_2$hgnc_symbol[output_hits_2$P.Value<0.01]

heatmap_matrix_tophits <- t(scale(t(heatmap_matrix[which(rownames(heatmap_matrix)
                          %in% top_hits),])))

# UNCOMMENT THIS TO HIDE CLUSTERING
# This is to specifically avoid clustering by condition
# heatmap_matrix_tophits<- heatmap_matrix_tophits[,
# c(grep(colnames(heatmap_matrix_tophits),pattern = "1"),
# grep(colnames(heatmap_matrix_tophits),pattern = "2"),
# grep(colnames(heatmap_matrix_tophits),pattern = "3"))]
  
if(min(heatmap_matrix_tophits) == 0){
  heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
  c( "white", "red"))
} else {
  heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, 
                          max(heatmap_matrix_tophits)), 
                          c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE, show_row_dend = TRUE,
cluster_columns = FALSE,show_column_dend = FALSE,
col=heatmap_col,show_column_names = TRUE,
show_row_names = FALSE,show_heatmap_legend = TRUE)

current_heatmap

Legend: The conditions here (status of non-remitter or control) are intentionally clustered together due to the order of the matrix (cluster argument still set to FALSE) used on the heat map (uncomment unclustering code to see it otherwise). One can notice that the bottom half of the over expressed sets on the non-remitter band is more active at remitters neurons 2 and 3 but still overexpressed across all test samples.

Now for non-remitters VS remitters:

# As inspired by lecture 6

heatmap_matrix <- normalized_counts[,4:9]

rownames(heatmap_matrix) <- rownames(normalized_counts)

colnames(heatmap_matrix) <- colnames(normalized_counts[,4:9])

heatmap_matrix <- t(scale(t(heatmap_matrix)))

top_hits <- output_hits_2$hgnc_symbol[output_hits_2$P.Value<0.01]

heatmap_matrix_tophits <- t(scale(t(heatmap_matrix[which(rownames(heatmap_matrix)
                          %in% top_hits),])))

# UNCOMMENT THIS TO HIDE CLUSTERING
# This is to specifically avoid clustering by condition
# heatmap_matrix_tophits<- heatmap_matrix_tophits[,
# c(grep(colnames(heatmap_matrix_tophits),pattern = "1"),
# grep(colnames(heatmap_matrix_tophits),pattern = "2"),
# grep(colnames(heatmap_matrix_tophits),pattern = "3"))]
  
if(min(heatmap_matrix_tophits) == 0){
  heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
  c( "white", "red"))
} else {
  heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, 
                          max(heatmap_matrix_tophits)), 
                          c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE, show_row_dend = TRUE,
cluster_columns = FALSE,show_column_dend = FALSE,
col=heatmap_col,show_column_names = TRUE,
show_row_names = FALSE,show_heatmap_legend = TRUE)

current_heatmap

The conditions here (status of remitter or non-remitter) are intentionally clustered together due to the order of the matrix used on the heat map (cluster argument still set to FALSE). One can notice that except for a mid-range band across all NR’s and except for the lower half of NR_neuron_1, the non-remitter are active than the correspondent band on the remitter neurons.

3. Thresholded over-representation analysis

3.1 Quantifying significantly up and down-regulated genes

First comparing non-remitters and controls:

## up regulated genes:
length(which(qlf_output_hits_RNxH$table$PValue < 0.05
& qlf_output_hits_RNxH$table$logFC > 0))
## [1] 1345
## down regulated genes:
length(which(qlf_output_hits_RNxH$table$PValue < 0.05
& qlf_output_hits_RNxH$table$logFC < 0))
## [1] 961

We have 1345 genes significantly up-regulated in non-remitters comparing to controls, and 961 down-regulated.

Now comparing non-remitters and remitters:

## up regulated genes:
length(which(qlf_output_hits_RNxR$table$PValue < 0.05
& qlf_output_hits_RNxR$table$logFC > 0))
## [1] 379
## down regulated genes:
length(which(qlf_output_hits_RNxR$table$PValue < 0.05
& qlf_output_hits_RNxR$table$logFC < 0))
## [1] 353

We have 379 genes significantly up-regulated in non-remitters comparing to controls, and 353 down-regulated, which is compatible with finding on paper, where the delta of expression among MDD patients is decreased when compared to remitters and controls.

3.2 Creating thresholded list

Here we will create a thresholded list, initially with the remitters dataset compared to the non remitters dataset, in which our rank will consist in the following product: -log(p-value) * logFC. We then create 3 files to contain, separately:

  1. all significantly differentially expressed genes together
  2. the up-regulated significantly differentially expressed genes
  3. the down-regulated significantly differentially expressed genes
# Inspired by lecture 7

# Attached name to gene results matrix
qlf_output_hits_withgn_RNxR <- merge(SSRI_exp[,1], qlf_output_hits_RNxR, by.x=1, by.y=0)

colnames(qlf_output_hits_withgn_RNxR) <- c("gname", colnames(qlf_output_hits_withgn_RNxR)[-1])

# Calculating our rank as -log(p-value) * logFC
qlf_output_hits_withgn_RNxR[,"rank"] <- -log(qlf_output_hits_withgn_RNxR$PValue,base=10) * 
                                         sign(qlf_output_hits_withgn_RNxR$logFC)

# Ordering per rank
qlf_output_hits_withgn_RNxR <- qlf_output_hits_withgn_RNxR[order(qlf_output_hits_withgn_RNxR$rank),]

# Separating up-regulated genes with threshold of 0.05
# Note: the previous calculation modeled for FC R/NR
# here, we are interested in FC NR/R
upregulated_genes_RNxR <- qlf_output_hits_withgn_RNxR$gname[which(qlf_output_hits_withgn_RNxR$PValue < 0.05
& qlf_output_hits_withgn_RNxR$logFC < 0)]

# Separating down-regulated genes with threshold of 0.05
downregulated_genes_RNxR <- qlf_output_hits_withgn_RNxR$gname[which(qlf_output_hits_withgn_RNxR$PValue 
                                                    < 0.05 & qlf_output_hits_withgn_RNxR$logFC > 0)]

Now doing the same to compare non-remitters to remitters:

# Inspired by lecture 7

# Attached name to gene results matrix
qlf_output_hits_withgn_RNxH <- merge(SSRI_exp[,1], qlf_output_hits_RNxH, by.x=1, by.y=0)

colnames(qlf_output_hits_withgn_RNxH) <- c("gname", colnames(qlf_output_hits_withgn_RNxH)[-1])

# Calculating our rank as -log(p-value) * logFC
qlf_output_hits_withgn_RNxH[,"rank"] <- -log(qlf_output_hits_withgn_RNxH$PValue,base=10) * 
                                         sign(qlf_output_hits_withgn_RNxH$logFC)

# Ordering per rank
qlf_output_hits_withgn_RNxH <- qlf_output_hits_withgn_RNxH[order(qlf_output_hits_withgn_RNxH$rank),]

# Separating up-regulated genes with threshold of 0.05
upregulated_genes_RNxH <- qlf_output_hits_withgn_RNxH$gname[which(qlf_output_hits_withgn_RNxH$PValue < 0.05
& qlf_output_hits_withgn_RNxH$logFC > 0)]

# Separating down-regulated genes with threshold of 0.05
downregulated_genes_RNxH <- qlf_output_hits_withgn_RNxH$gname[which(qlf_output_hits_withgn_RNxH$PValue 
                                                    < 0.05 & qlf_output_hits_withgn_RNxH$logFC < 0)]

Save tables for later use if docker port connection is properly configured:

dir.create('output', showWarnings=FALSE)

# Saving up-regulated genes NRxR
write.table(x=upregulated_genes_RNxR, file=file.path("./","RNxR_upregulated_genes.txt"),sep="\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

# Saving down-regulated genes NRxR
write.table(x=downregulated_genes_RNxR, file=file.path("./","RNxR_downregulated_genes.txt"),sep="\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

# Saving complete list NRxR
write.table(x=data.frame(genename=qlf_output_hits_withgn_RNxR$gname,
                         F_stat=qlf_output_hits_withgn_RNxR$rank
                         ),
                         file=file.path("./","RNxR_ranked_genelist.txt"),
                         sep = "\t",
                         row.names = FALSE,col.names = FALSE,quote = FALSE)

# Saving up-regulated genes RNxH
write.table(x=upregulated_genes_RNxH, file=file.path("./","RNxH_upregulated_genes.txt"),sep="\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

# Saving down-regulated genes RNxH
write.table(x=downregulated_genes_RNxH, file=file.path("./","RNxH_downregulated_genes.txt"),sep="\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

# Saving complete list RNxH
write.table(x=data.frame(genename=qlf_output_hits_withgn_RNxH$gname,
                         F_stat=qlf_output_hits_withgn_RNxH$rank
                         ),
                         file=file.path("./","RNxH_ranked_genelist.txt"),
                         sep = "\t",
                         row.names = FALSE,col.names = FALSE,quote = FALSE)

3.3 Gene List Function Enrichment (over-representation) Analysis

We will be using G:Profiler web server(Uku Raudvere, 2019)to perform the above gene list enrichment analysis (version e102_eg49_p15_7a9b4d6).

Thus we will be using the following annotation database in the enrichment:

  • GO biological process annotation release 2021-02-01 (with Ensembl version 102)
  • Reactome version 75 (December 2020)
  • Wikipathways (December 2020 release)

All electronic GO annotations were discarded and the annotation term size was restricted to 200 genes.

The tool uses Fisher Exact test. We used 0.05 threshold for significance. In our list of up and down regulated genes we could filter for only 4 members of HTR gene family, since we had a strict standard to cut out any mapped genes with less than 9 reads. Thus we have now decided to use Benjamni-Hochberg correction for multiple hypothesis testing due to its less stringent profile compared to Bonferoni correction.

3.3.1 Results for non-remitters VS controls analysis in up-regulated genes

Fig 1, ORA results non-remitters VS controls in up-regulated genes Fig 1 - Legend: ORA results non-remitters VS controls in up-regulated genes

Fig 2 ORA with Wiki Pathways results for non-remitters VS controls in up-regulated genes Fig 2 - Legend: ORA with Wiki Pathway results for non-remitters VS controls in up-regulated genes, here we notice noticed all 31 results over the 0.05 threshold and terms of maximum size of 200.

Fig 3 ORA with reactome results for non-remitters VS controls in up-regulated genes Fig 3 - Legend: ORA with Reactome results for non-remitters VS controls in up-regulated genes, here we notice the most significant results of a total of 89 results over the 0.05 threshold with terms of maximum size of 200.

Fig 4 ORA with GO results for non-remitters VS controls in up-regulated genes Fig 4 - Legend: ORA with GO results for non-remitters VS controls in up-regulated genes, here we notice noticed the 35 most significant entries of a total of 467 results over the 0.05 threshold and terms of maximum size of 200.

3.3.2 Results for non-remitters VS controls analysis in down-regulated genes

Fig 5, ORA results non-remitters VS controls in down-regulated genes Fig 5 - Legend: ORA results non-remitters VS controls in down-regulated genes

Fig 6 ORA with Wiki Pathways results for non-remitters VS controls in down-regulated genes Fig 6 - Legend: ORA with Wiki Pathway results for non-remitters VS controls in down-regulated genes, here we notice noticed all 26 results over the 0.05 threshold and terms of maximum size of 200.

Fig 7 ORA with reactome results for non-remitters VS controls in down-regulated genes Fig 7 - Legend: ORA with Reactome results for non-remitters VS controls in down-regulated genes, here we notice the most significant results of a total of 139 results over the 0.05 threshold with terms of maximum size of 200.

Fig 8 ORA with GO results for non-remitters VS controls in down-regulated genes Fig 8 - Legend: ORA with GO results for non-remitters VS controls in down-regulated genes, here we notice noticed the most significant entries of a total of 89 results over the 0.05 threshold and terms of maximum size of 200.

3.3.3 Results for non-remitters VS controls analysis for all genes

Fig 9, ORA results non-remitters VS controls in all genes Fig 9 - Legend: ORA results non-remitters VS controls in all-regulated genes

Fig 10 ORA with Wiki Pathways results for non-remitters VS controls in all Fig 10 - Legend: ORA with Wiki Pathway results for non-remitters VS controls in all genes, here we notice noticed the most significant of the 175 results over the 0.05 threshold and terms of maximum size of 200.

Fig 11 ORA with reactome results for non-remitters VS controls in all genes Fig 11 - Legend: ORA with Reactome results for non-remitters VS controls in all genes, here we notice the most significant results of a total of 601 results over the 0.05 threshold with terms of maximum size of 200.

Fig 12 ORA with GO results for non-remitters VS controls in all genes Fig 12 - Legend: ORA with GO results for non-remitters VS controls in all genes, here we notice noticed the most significant entries of a total of 962 results over the 0.05 threshold and terms of maximum size of 200.

3.3.4 Results for non-remitters VS remitters analysis in up-regulated genes

Fig 13, ORA results non-remitters VS remitters in up-regulated genes

Fig 13 - Legend: ORA results non-remitters VS remitters in up-regulated genes

Fig 14 ORA with Wiki Pathways results for non-remitters VS remitters in up-regulated genes

Fig 14 - Legend: ORA with Wiki Pathway results for non-remitters VS remitters in up-regulated genes, here we notice noticed all 12 results over the 0.05 threshold and terms of maximum size of 200.

Fig 15 ORA with reactome results for non-remitters VS remitters in up-regulated genes

Fig 15 - Legend: ORA with Reactome results for non-remitters VS remitters in up-regulated genes, here we notice all 28 results over the 0.05 threshold with terms of maximum size of 200.

Fig 16 ORA with GO results for non-remitters VS remitters in up-regulated genes

Fig 16 - Legend: ORA with GO results for non-remitters VS remitters in up-regulated genes, here we notice noticed the most significant entries of a total of 110 results over the 0.05 threshold and terms of maximum size of 200.

3.3.5 Results for non-remitters VS remitters analysis in down-regulated genes

Fig 17, ORA results non-remitters VS remitters in down-regulated genes Fig 17 - Legend: ORA results non-remitters VS remitters in down-regulated genes

Fig 18 ORA with Wiki Pathways results for non-remitters VS remitters in down-regulated genes Fig 18 - Legend: ORA with Wiki Pathway results for non-remitters VS remitters in down-regulated genes, here we notice noticed all 3 results over the 0.05 threshold and terms of maximum size of 200.

Fig 19 ORA with reactome results for non-remitters VS remitters in down-regulated genes Fig 19 - Legend: ORA with Reactome results for non-remitters VS remitters in down-regulated genes, here we notice no result passed the 0.05 threshold with terms of maximum size of 200.

Fig 20 ORA with GO results for non-remitters VS remitters in down-regulated genes Fig 20 - Legend: ORA with GO results for non-remitters VS remitters in down-regulated genes, here we notice noticed all the 9 results over the 0.05 threshold and terms of maximum size of 200.

3.3.6 Results for non-remitters VS remitters analysis for all genes

Fig 13, ORA results non-remitters VS remitters in all genes Fig 13 - Legend: ORA results non-remitters VS remitters in all-regulated genes

Fig 14 ORA with Wiki Pathways results for non-remitters VS remitters in all Fig 14 - Legend: ORA with Wiki Pathway results for non-remitters VS remitters in all genes, here we notice noticed the most significant of the 123 results over the 0.05 threshold and terms of maximum size of 200.

Fig 15 ORA with reactome results for non-remitters VS remitters in all genes Fig 15 - Legend: ORA with Reactome results for non-remitters VS remitters in all genes, here we notice the most significant results of a total of 626 results over the 0.05 threshold with terms of maximum size of 200.

Fig 16 ORA with GO results for non-remitters VS remitters in all genes Fig 16 - Legend: ORA with GO results for non-remitters VS remitters in all genes, here we notice noticed the most significant entries of a total of 981 results over the 0.05 threshold and terms of maximum size of 200.

4. Interpretation

The forebrain neurons used in the study, as discussed in A1, were developed through a dedifferentiation technique which induce the formation pluripotent stem cells (iPSCs) from skin fibroblast cells. That technique is related to up and down regulation of several groups of genes that will be reflected on the results at section 3. Many of the results, especially from GO annotation correlates with literature on the dedifferentiation technique (Soliman MA, 2017)(Brennand KJ, 2011).

Even after taking that into consideration, the number of possible functional pathways elicited by the enrichment analysis revealed, especially when using reactome and Wiki Pathways annotations, that the number of over represented sets on the comparison of non-remitters and controls is much higher than at the comparison of non-remitters and remitters, which is also compatible with the source study(Vadodaria et al., 2019), but also in literature comparing pharmacological-response phenotype subgroups among those affected by MDD (Mrazek DA, 2014)(Drysdale AT, 2017).

Important to notice is the over-representation of eye development pathways in neurons exposed to SSRI, a finding that strongly correlate to 5-HT role in retinal process(George et al., 2005), and should therefore impair the viability of hypothesis based on its differential expression as markers of MDD itself.

The most interesting finding from the current thresholded enrichment analysis, even after pondering over the limitations above, is the presence of over-represented sets on opiate receptor family and monoamine transport pathways, the later being confirmed by the source literature(Vadodaria et al., 2019). The role of endogenous opiate in SSRI-resistant MDD patients may be, at least preliminary, good targets for further consideration, which I hope to undertake during A3 and its ranked approach to enrichment analysis.

5. Summary answers for key questions

5.1 How many genes were significantly differentially expressed? What thresholdsdid you use and why?

Using the main paper’s modeling (please see section 2.1.3), 2306 were differentially expressed for the non-remitters VS controls analysis and 732 for the remitters VS non-remitters analysis. For the other models please see section 2.1.1 and 2.1.2.

The threshold used was 0.05 due to its widely accepted marker of significance although an analysis with 0.09 for the paper’s model was also used in section 2.1.3 due to the fact that the source paper uses it for one of it tests (please see details at 2.1.3)

5.2 Regarding multiple hypothesis correction method, which method did you use? And Why? How many genes passed correction?

Please see section 2 for details.

We used Benjamni-Hochberg correction for multiple testing.

Reason: In A1, we could filter for only 4 members of HTR gene family (which is relevant for the main paper), since we had a strict standard to cut out any mapped genes with less than 9 reads. Thus we have now decided to use Benjamni-Hochberg correction for multiple hypothesis testing due to its less stringent profile compared to Bonferoni correction.

No gene passed the correction, prompting for the use of functional enrichment analysis.

5.3 Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest.

Please see section 2.2

5.4 Visualize your top hits using a heatmap. Do you conditions cluster together? Explain why or why not.

Please see section 2.3. Conditions intentionally cluster together due to the order of processing, despite cluster argument being set to FALSE. (uncomment related chunk to uncluster, please see details on section 2.3)

5.5 Regarding ORA, which method did you choose and why?

Please see section 3.1 and 3.2 for details.

We used Fisher Exact test on over-representation analysis through the G:Profiler web server(Uku Raudvere, 2019) (version e102_eg49_p15_7a9b4d6). We used this method for being appropriate to our thresholded list and easier to automate since it is built in at the G:Profiler tool for functional enrichment analysis.

We use Benjamni-Hochberg correction for multiple hypothesis testing due to its less stringent profile compared to Bonferoni correction

5.5 What annotation data did you use and why? What version of the annotation are you using?

Please see section 3.3 for details.

  • GO biological process annotation release 2021-02-01 (with Ensembl version 102)
  • Reactome version 75 (December 2020)
  • Wikipathways (December 2020 release) All electronic GO annotations were discarded and the annotation term size was restricted to 200 genes.

5.6 How many genesets were returned with what thresholds?

Please see section 3.3 for details.

ORA with Wiki Pathway results for non-remitters VS controls in up-regulated genes:

  • 31 results over the 0.05 threshold and terms of maximum size of 200.

ORA with Reactome results for non-remitters VS controls in up-regulated genes:

  • 89 results over the 0.05 threshold with terms of maximum size of 200.

ORA with GO results for non-remitters VS controls in up-regulated genes:

  • 467 results over the 0.05 threshold and terms of maximum size of 200.

ORA with Wiki Pathway results for non-remitters VS controls in down-regulated genes:

  • 26 results over the 0.05 threshold and terms of maximum size of 200.

ORA with Reactome results for non-remitters VS controls in down-regulated genes:

  • 139 results over the 0.05 threshold with terms of maximum size of 200.

ORA with GO results for non-remitters VS controls in down-regulated genes:

  • 89 results over the 0.05 threshold and terms of maximum size of 200.

ORA with Wiki Pathway results for non-remitters VS controls in all genes:

  • 175 results over the 0.05 threshold and terms of maximum size of 200.

ORA with Reactome results for non-remitters VS controls in all genes:

  • 601 results over the 0.05 threshold with terms of maximum size of 200.

ORA with GO results for non-remitters VS controls in all genes:

  • 962 results over the 0.05 threshold and terms of maximum size of 200.

ORA with Wiki Pathway results for non-remitters VS remitters in up-regulated genes:

  • 12 results over the 0.05 threshold and terms of maximum size of 200.

ORA with Reactome results for non-remitters VS remitters in up-regulated genes:

  • 28 results over the 0.05 threshold with terms of maximum size of 200.

OORA with GO results for non-remitters VS remitters in up-regulated genes:

  • 110 results over the 0.05 threshold and terms of maximum size of 200.

ORA with Wiki Pathway results for non-remitters VS remitters in down-regulated genes:

  • 3 results over the 0.05 threshold and terms of maximum size of 200.

ORA with Reactome results for non-remitters VS remitters in down-regulated genes:

  • no result passed the 0.05 threshold with terms of maximum size of 200.

ORA with GO results for non-remitters VS remitters in down-regulated genes:

  • 9 results over the 0.05 threshold and terms of maximum size of 200.

ORA with Wiki Pathway results for non-remitters VS remitters in all genes:

  • 123 results over the 0.05 threshold and terms of maximum size of 200.

ORA with Reactome results for non-remitters VS remitters in all genes:

  • 626 results over the 0.05 threshold with terms of maximum size of 200.

ORA with GO results for non-remitters VS remitters in all genes:

  • 981 results over the 0.05 threshold and terms of maximum size of 200.

5.7 How do these results compare to using the whole list

The results on up-regulated and down-regulated were much fewer but much more significant the the ones yielded by ORA on the whole list, compatible with what was expected. Please see section 4 for details.

5.8 Do the over-representation results support conclusions or mechanism discussed in the original paper?

Yes, please see details in section 4.

6. References

Brennand KJ, et a. (2011). Modelling schizophrenia using human induced pluripotent stem cells. Nature, 473, 221–225.

Davis, S., & Meltzer, P. (2007). GEOquery: A bridge between the gene expression omnibus (geo) and bioconductor. Bioinformatics, 14, 1846–1847.

Drysdale AT, et a. (2017). Resting-state connectivity biomarkers define neurophysiological subtypes of depression. Nat Med., 23, 28–38.

Gene expression omnibus. (n.d.). In National Center for Biotechnology Information. U.S. National Library of Medicine. https://www.ncbi.nlm.nih.gov/geo/

George, A., Schmid, K. L., & Pow, D. V. (2005). Retinal serotonin, eye growth and myopia development in chick. Experimental Eye Research, 81(5), 616–625.

Gu, Z., Eils, R., & Schlesner, M. (2016). Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics.

Gu, Z., Gu, L., Eils, R., Schlesner, M., & Brors, B. (2014). Circlize implements and enhances circular visualization in r. Bioinformatics, 30(19), 2811–2812.

Morgan, M. (2019). BiocManager: Access the bioconductor project package repository. https://CRAN.R-project.org/package=BiocManager

Mrazek DA, et a. (2014). Treatment outcomes of depression: The pharmacogenomic research network antidepressant medication pharmacogenomic study. J Clin Psychopharmacol., 34, 313–317.

Müller, K., Wickham, H., James, D. A., & Falcon, S. (2020). RSQLite: ’SQLite’ interface for r.

Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), e47. https://doi.org/10.1093/nar/gkv007

Soliman MA, Z. N., Aboharb F. (2017). Pluripotent stem cells in neuropsychiatric disorders. Mol Psychiatry, 22, 1241–1249.

Uku Raudvere, I. K., Liis Kolberg. (2019). G:Profiler: A web server for functional enrichment analysis and conversions of gene lists.

Vadodaria, K. C., Ji, Y., Skime, M., Paquola, A., Nelson, T., Hall-Flavin, D., Fredlender, C., Heard, K. J., Deng, Y., Le, A. T., & others. (2019). Serotonin-induced hyperactivity in ssri-resistant major depressive disorder patient-derived neurons. Molecular Psychiatry, 24(6), 795–807.

Vadodaria KC, G. F. (2019). Serotonin-induced hyperactivity in ssri-resistant major depressive disorder patient-derived neurons. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE125664

Xie, Y. (2021). Knitr: A general-purpose package for dynamic report generation in r. https://yihui.org/knitr/

Zhu, Y., Davis, S., Stephens, R., Meltzer, P. S., & Chen, Y. (2008). GEOmetadb: Powerful alternative search engine for the gene expression omnibus. Bioinformatics (Oxford, England), 24(23), 2798–2800. https://doi.org/10.1093/bioinformatics/btn520